Identification of cytotoxicity-related genes in breast cancer using RNA-seq

Marina Sangés Ametllé (s223690)

What is RNA-seq?

OUR DATA

Our data consists of the reads of the RNA sequencing data that was normalized using FPKM data (Fragments Per Kilobase Million). This normalization allows us to quantify gene expression levels.

The data is composed by 60 samples of breast cells in total: 30 of tumor cells and 30 of normal cells.

Per each one of the samples, the expression of 20246 genes will be assessed.

Final dataset size: 20246 observations of 60 variables.

GOAL

The main goal of this analysis will be see what are the differencially expressed genes and what biological processes they affect comparing tumor and normal breast cells.

CLEANING

  • Change variable names to a more understandable name.
  • Check the presence of null values.
  • Change variable names.
  • Check for non-valid samples.

CLEANING: CHECK FOR NON-VALID SAMPLES

We will consider all the samples, but with an eye on the samples normal_rep14, normal_rep6 and tumor_rep3 since they have expression in less genes than the rest of the samples.

DATA EXPLORATION: PCA

There is a good separation of the two group, but some samples do not follow the pattern. The samples we expected before??

RESULTS: HEATMAP

RESULTS: VOLCANO PLOT

RESULTS: GENE ENRICHMENT

CONCLUSIONS

  1. All the samples have a similar average of genes that have expression except three of them.
  1. Both tumor and normal cells can be separated well in a PCA plot.
  1. There is a total of 385 differentially expressed genes. Tumor and normal breast cells have different DE (differentially expressed) genes as a group.
  1. There are more downregulated genes in the tumor vs normal cells than upregulated genes.
  1. Gene enrichment showed that the differentially expressed genes affect specially keratine processes, cell division and cell-cell interaction.

THANK YOU FOR YOUR ATTENTION!

Marina Sangés Ametllé (s223690)